Combine transaction, demographic and offer data to determine which demographic groups respond best to which offer type
This data set contains simulated data that mimics customer behavior on the Starbucks rewards mobile app. The program used to create the data simulates how people make purchasing decisions and how those decisions are influenced by promotional offers. Once every few days, Starbucks sends out an offer to users of the mobile app.
This data set is a simplified version of the real Starbucks app because the underlying simulator only has one product whereas Starbucks actually sells dozens of products. Only the amounts of each transaction or offer are recorded.
There are three types of offers that can be sent:
Some users might not receive any offer during certain weeks, and not all users receive the same offer, and that is the challenge to solve with this data set.
Each person in the simulation has some hidden traits that influence their purchasing patterns and are associated with their observable traits. People produce various events, including receiving offers, opening offers, and making purchases.
The goal is to combine transaction, demographic and offer data to:
determine which demographic groups respond best to which offer type
%reset -f
# %load_ext autoreload
# %autoreload 2
# must be called before importing matplotlib!!!
# %matplotlib widget
import pandas as pd
import numpy as np
import math
import json
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns
import dataframe_image as dfi
sns.set(style='whitegrid')
plt.rcParams['font.family'] = 'monospace'
plt.rcParams['font.weight'] = 'bold'
plt.rcParams['font.size'] = 16
plt.rc('axes', labelsize=16)
load CLEANED DATA
to go directly to ANALYSIS
# read in the json files
portfolio = pd.read_json('data/portfolio.json', orient='records', lines=True)
profile = pd.read_json('data/profile.json', orient='records', lines=True)
transcript = pd.read_json('data/transcript.json', orient='records', lines=True)
profile¶profile.json : Rewards program users (17000 users x 5 fields)
profile.info()
profile.head()
portfolio¶portfolio.json: Offers sent during 30-day test period (10 offers x 6 fields)
portfolio.info()
Every offer has a validity period before the offer expires:
bogo offer might be valid for only 5 days or 7 days. informational offers have a validity period even though these ads are merely providing information about a product; for example, if an informational offer has 7 days of validity, you can assume the customer is feeling the influence of the offer for 7 days after receiving the advertisement.portfolio
transcript¶Transactional data showing user purchases made on the app including the timestamp of purchase and the amount of money spent on a purchase. This transactional data also has a record for each offer that a user receives as well as a record for when a user actually views the offer. There are also records for when a user completes an offer.
Keep in mind as well that someone using the app might make a purchase through the app without having received an offer or seen an offer.
transcript.json: Event log (306648 events x 4 fields)
transcript.info()
transcript.head()
event_type = transcript['event'].unique()
print(f'we found {len(event_type)} types of offers:\n')
for evt in event_type:
print(f'\t"{evt}"')
load CLEANED DATA
to go directly to ANALYSIS
profile¶fitering out participants with missing values:
age = 118income = NaNgender = Noneprofile = profile.query('age != 118')
profile.isnull().sum()
gender 0 age 0 id 0 became_member_on 0 income 0 dtype: int64
by filtering out the
agewe removed all missing values
# we will be using id as the index
profile.set_index('id', inplace=True)
# change `became_member_on` to datetime
profile['became_member_on'] = pd.to_datetime(profile['became_member_on'], format='%Y%m%d')
# count number of unique users
n_users = profile.index.unique().__len__()
print(f'We found {n_users} unique participants in the profile dataset')
We found 14825 unique participants in the profile dataset
profile.head()
| gender | age | became_member_on | income | |
|---|---|---|---|---|
| id | ||||
| 0610b486422d4921ae7d2bf64640c50b | F | 55 | 2017-07-15 | 112000.0 |
| 78afa995795e4d85b5d9ceeca43f5fef | F | 75 | 2017-05-09 | 100000.0 |
| e2127556f4f64592b11af22de27a7932 | M | 68 | 2018-04-26 | 70000.0 |
| 389bc3fa690240e798340f5a15918d5c | M | 65 | 2018-02-09 | 53000.0 |
| 2eeac8d8feae4a8cad5a6af0499a211d | M | 58 | 2017-11-11 | 51000.0 |
portfolio¶# we will be using id as the index
portfolio.set_index('id', inplace=True)
# Creating code for each offer
portfolio['code'] = portfolio.apply(lambda x: x['offer_type'][0].capitalize()+'.'+str(x['duration']).zfill(2) +'.'+str(x['difficulty']).zfill(2), axis=1)
portfolio.sort_values(by=['offer_type', 'difficulty', 'duration'])
| reward | channels | difficulty | duration | offer_type | code | |
|---|---|---|---|---|---|---|
| id | ||||||
| f19421c1d4aa40978ebb69ca19b0e20d | 5 | [web, email, mobile, social] | 5 | 5 | bogo | B.05.05 |
| 9b98b8c7a33c4b65b9aebfe6a799e6d9 | 5 | [web, email, mobile] | 5 | 7 | bogo | B.07.05 |
| 4d5c57ea9a6940dd891ad53e9dbe8da0 | 10 | [web, email, mobile, social] | 10 | 5 | bogo | B.05.10 |
| ae264e3637204a6fb9bb56bc8210ddfd | 10 | [email, mobile, social] | 10 | 7 | bogo | B.07.10 |
| 2298d6c36e964ae4a3e7e9706d1fb8c2 | 3 | [web, email, mobile, social] | 7 | 7 | discount | D.07.07 |
| 2906b810c7d4411798c6938adc9daaa5 | 2 | [web, email, mobile] | 10 | 7 | discount | D.07.10 |
| fafdcd668e3743c1bb461111dcafc2a4 | 2 | [web, email, mobile, social] | 10 | 10 | discount | D.10.10 |
| 0b1e1539f2cc45b7b9fa7c272da2e1d7 | 5 | [web, email] | 20 | 10 | discount | D.10.20 |
| 5a8bc65990b245e5a138643cd4eb9837 | 0 | [email, mobile, social] | 0 | 3 | informational | I.03.00 |
| 3f207df678b143eea3cee63160fa8bed | 0 | [web, email, mobile] | 0 | 4 | informational | I.04.00 |
# create dummy varibale per channels
channels = pd.DataFrame(index=portfolio.index)
for col in ['web', 'email', 'mobile', 'social']:
channels[col] = portfolio['channels'].apply(lambda x: int(col in x))
channels
| web | mobile | social | ||
|---|---|---|---|---|
| id | ||||
| ae264e3637204a6fb9bb56bc8210ddfd | 0 | 1 | 1 | 1 |
| 4d5c57ea9a6940dd891ad53e9dbe8da0 | 1 | 1 | 1 | 1 |
| 3f207df678b143eea3cee63160fa8bed | 1 | 1 | 1 | 0 |
| 9b98b8c7a33c4b65b9aebfe6a799e6d9 | 1 | 1 | 1 | 0 |
| 0b1e1539f2cc45b7b9fa7c272da2e1d7 | 1 | 1 | 0 | 0 |
| 2298d6c36e964ae4a3e7e9706d1fb8c2 | 1 | 1 | 1 | 1 |
| fafdcd668e3743c1bb461111dcafc2a4 | 1 | 1 | 1 | 1 |
| 5a8bc65990b245e5a138643cd4eb9837 | 0 | 1 | 1 | 1 |
| f19421c1d4aa40978ebb69ca19b0e20d | 1 | 1 | 1 | 1 |
| 2906b810c7d4411798c6938adc9daaa5 | 1 | 1 | 1 | 0 |
portfolio = portfolio.drop('channels', axis=1).join(channels.drop('email', axis=1))
transcript¶transcript according to the participants available in the profile dataset¶user_list = list(profile.index)
transcript = transcript.query('person in @user_list')
transcript.head()
| person | event | value | time | |
|---|---|---|---|---|
| 0 | 78afa995795e4d85b5d9ceeca43f5fef | offer received | {'offer id': '9b98b8c7a33c4b65b9aebfe6a799e6d9'} | 0 |
| 2 | e2127556f4f64592b11af22de27a7932 | offer received | {'offer id': '2906b810c7d4411798c6938adc9daaa5'} | 0 |
| 5 | 389bc3fa690240e798340f5a15918d5c | offer received | {'offer id': 'f19421c1d4aa40978ebb69ca19b0e20d'} | 0 |
| 7 | 2eeac8d8feae4a8cad5a6af0499a211d | offer received | {'offer id': '3f207df678b143eea3cee63160fa8bed'} | 0 |
| 8 | aa4862eba776480b8bb9c68455b8c2e1 | offer received | {'offer id': '0b1e1539f2cc45b7b9fa7c272da2e1d7'} | 0 |
n_users_trans = transcript.person.unique().__len__()
print(f'We found {n_users_trans} unique participants in the transcript dataset')
We found 14825 unique participants in the transcript dataset
Since we will look at the offer received to measure the impact of each other we will check if all participants received at least 1 offer.
user_received = transcript.query('event == "offer received"')['person'].unique()
no_offer_list = list(set(profile.index).difference(set(user_received))) # TO BE REMOVED
print(f'{len(no_offer_list)} participants did not received any offer')
transcript.query('person in @no_offer_list')
5 participants did not received any offer
| person | event | value | time | |
|---|---|---|---|---|
| 16379 | eb540099db834cf59001f83a4561aef3 | transaction | {'amount': 4.74} | 6 |
| 22092 | 12ede229379747bd8d74ccdc20097ca3 | transaction | {'amount': 16.62} | 18 |
| 69477 | c6e579c6821c41d1a7a6a9cf936e91bb | transaction | {'amount': 0.65} | 174 |
| 87239 | c6e579c6821c41d1a7a6a9cf936e91bb | transaction | {'amount': 1.9100000000000001} | 222 |
| 106812 | ae8111e7e8cd4b60a8d35c42c1110555 | transaction | {'amount': 17.51} | 306 |
| 108268 | eb540099db834cf59001f83a4561aef3 | transaction | {'amount': 5.09} | 318 |
| 109340 | 3a4874d8f0ef42b9a1b72294902afea9 | transaction | {'amount': 22.75} | 324 |
| 109815 | 12ede229379747bd8d74ccdc20097ca3 | transaction | {'amount': 27.6} | 324 |
| 110707 | ae8111e7e8cd4b60a8d35c42c1110555 | transaction | {'amount': 28.7} | 330 |
| 142347 | 3a4874d8f0ef42b9a1b72294902afea9 | transaction | {'amount': 27.2} | 378 |
| 163610 | c6e579c6821c41d1a7a6a9cf936e91bb | transaction | {'amount': 1.25} | 408 |
| 184609 | ae8111e7e8cd4b60a8d35c42c1110555 | transaction | {'amount': 11.53} | 444 |
| 228700 | eb540099db834cf59001f83a4561aef3 | transaction | {'amount': 7.4} | 528 |
| 243317 | c6e579c6821c41d1a7a6a9cf936e91bb | transaction | {'amount': 3.14} | 570 |
| 259644 | 3a4874d8f0ef42b9a1b72294902afea9 | transaction | {'amount': 17.67} | 576 |
| 269818 | eb540099db834cf59001f83a4561aef3 | transaction | {'amount': 8.78} | 594 |
| 287598 | ae8111e7e8cd4b60a8d35c42c1110555 | transaction | {'amount': 9.72} | 636 |
| 292894 | ae8111e7e8cd4b60a8d35c42c1110555 | transaction | {'amount': 18.97} | 654 |
| 292962 | 12ede229379747bd8d74ccdc20097ca3 | transaction | {'amount': 19.03} | 654 |
Since we will measure the impact of each other with the transactions that occured we will check if all participants achieve at least 1 transaction.
user_transactions = transcript.query('event == "transaction"')['person'].unique()
no_transaction_list = list(set(profile.index).difference(set(user_transactions))) # TO BE REMOVED
print(f'{len(no_transaction_list)} participants did not make any transactions')
transcript.query('person in @no_transaction_list')
333 participants did not make any transactions
| person | event | value | time | |
|---|---|---|---|---|
| 60 | 13b7b2eccf664a329d83d2a238a9a11d | offer received | {'offer id': 'f19421c1d4aa40978ebb69ca19b0e20d'} | 0 |
| 85 | d41fa9b8c2e24a1d94caa890c53bbcf6 | offer received | {'offer id': '9b98b8c7a33c4b65b9aebfe6a799e6d9'} | 0 |
| 141 | accec36a10704e1fbaeb9b28b8bc5da8 | offer received | {'offer id': 'fafdcd668e3743c1bb461111dcafc2a4'} | 0 |
| 155 | 688e4f2487d44ad8bbcc0f96936f42d6 | offer received | {'offer id': '5a8bc65990b245e5a138643cd4eb9837'} | 0 |
| 160 | 4e5923e5deab4d5cb1624a0b1cc9909c | offer received | {'offer id': '2298d6c36e964ae4a3e7e9706d1fb8c2'} | 0 |
| ... | ... | ... | ... | ... |
| 302780 | 17360c8e66a840fb82441f6c0eb8bf47 | offer viewed | {'offer id': '0b1e1539f2cc45b7b9fa7c272da2e1d7'} | 696 |
| 303730 | e9980e136f9e4f00a2e2c895907b0d82 | offer viewed | {'offer id': 'ae264e3637204a6fb9bb56bc8210ddfd'} | 702 |
| 304010 | 456f5d7fe4f14732b01a85b3a584ab82 | offer viewed | {'offer id': 'ae264e3637204a6fb9bb56bc8210ddfd'} | 702 |
| 304457 | 1aa4e69892454c308421ad7dacac366a | offer viewed | {'offer id': '3f207df678b143eea3cee63160fa8bed'} | 708 |
| 306450 | 9b51e8797290403b90d09d864dec4b94 | offer viewed | {'offer id': '3f207df678b143eea3cee63160fa8bed'} | 714 |
2617 rows × 4 columns
offer received or no transactions made¶participant_filter = list(set(no_offer_list + no_transaction_list))
transcript = transcript.query('person not in @participant_filter')
user_profile = set(profile.index)
user_received = set(transcript.query('event == "offer received"')['person'])
user_trans = set(transcript.query('event == "transaction"')['person'])
# user_list = list(user_received.intersection(user_trans).intersection(user_profile))
user_profile.intersection_update(user_received, user_trans)
len(user_profile)
14487
value dictionary and split it into amount, offer_id, and reward¶def expand_trasncript(transcript):
#change 'offer id' in 'offer_id'
value = transcript['value'].\
apply(lambda x:{('offer_id' if k=='offer id' else k): v for k, v in x.items()})
#expand dictionary into 3 columns and add to transcript
n_trans = transcript.join(pd.json_normalize(value))
return n_trans.drop(columns=['value']) # drop obsolete columns
n_trans = expand_trasncript(transcript)
n_trans.head()
| person | event | time | offer_id | amount | reward | |
|---|---|---|---|---|---|---|
| 0 | 78afa995795e4d85b5d9ceeca43f5fef | offer received | 0 | 9b98b8c7a33c4b65b9aebfe6a799e6d9 | NaN | NaN |
| 2 | e2127556f4f64592b11af22de27a7932 | offer received | 0 | f19421c1d4aa40978ebb69ca19b0e20d | NaN | NaN |
| 5 | 389bc3fa690240e798340f5a15918d5c | offer received | 0 | 0b1e1539f2cc45b7b9fa7c272da2e1d7 | NaN | NaN |
| 7 | 2eeac8d8feae4a8cad5a6af0499a211d | offer received | 0 | 3f207df678b143eea3cee63160fa8bed | NaN | NaN |
| 8 | aa4862eba776480b8bb9c68455b8c2e1 | offer received | 0 | 0b1e1539f2cc45b7b9fa7c272da2e1d7 | NaN | NaN |
In order to analyze the data we created to classes (see code.starbucks_class):
Person: participant object:_id: participant id as found in TRANSCRIPT (person) and PROFILE (id)data: all data contain in TRANSCRIPToffers: offers ids from received offerstotal_spending: cumulative sum of all transactionsget_transaction() and get_reward() to collect all transactions or reward between 2 timestamps.Event: object for received offer:_id: index of the received offer in TRANSCRIPToffer_id: offer id as found in PROFILEdetails: all details of the specific offer id as found in PROFILE (duration, difficulty, ...)offer_type: offer type (bogo, discount, or informational)start: time (in hours) where the offer was receivedend: time (in hours) where the offer is supposed to end (based on offer duration)events: subset of TRANSCRIPT from start to endviewed: timestamp (hours) if this offer was viewed, np.nan if notcompleted: timestamp (hours) if this offer was completed, np.nan if notimport numpy as np
import pandas as pd
from code.starbucks_class import Person, Event, PORTFOLIO, PROFILE, TRANSCRIPT
/Users/sebastienvillard/Documents/Perso/2020_DataScience_Class/Udacity_nanoDegree/Projects/Project4_CapstoneProject/code/data_wrangling.py:32: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy PROFILE.loc[:, 'became_member_on'], format='%Y%m%d')
# gathering information from a participant included in TRANSCRIPT
user_list = TRANSCRIPT.person.unique()
user_id = user_list[5]
User = Person(user_id)
User.data
| person | event | time | offer_id | amount | reward | |
|---|---|---|---|---|---|---|
| 5 | 31dda685af34476cad5bc968bdb01c53 | offer received | 0 | 0b1e1539f2cc45b7b9fa7c272da2e1d7 | NaN | NaN |
| 44475 | 31dda685af34476cad5bc968bdb01c53 | offer viewed | 150 | 0b1e1539f2cc45b7b9fa7c272da2e1d7 | NaN | NaN |
| 46972 | 31dda685af34476cad5bc968bdb01c53 | offer received | 168 | fafdcd668e3743c1bb461111dcafc2a4 | NaN | NaN |
| 57733 | 31dda685af34476cad5bc968bdb01c53 | offer viewed | 168 | fafdcd668e3743c1bb461111dcafc2a4 | NaN | NaN |
| 84856 | 31dda685af34476cad5bc968bdb01c53 | transaction | 258 | NaN | 24.87 | NaN |
| 84857 | 31dda685af34476cad5bc968bdb01c53 | offer completed | 258 | fafdcd668e3743c1bb461111dcafc2a4 | NaN | 2.0 |
| 90493 | 31dda685af34476cad5bc968bdb01c53 | transaction | 288 | NaN | 16.26 | NaN |
| 93431 | 31dda685af34476cad5bc968bdb01c53 | transaction | 306 | NaN | 15.80 | NaN |
| 97797 | 31dda685af34476cad5bc968bdb01c53 | offer received | 336 | 2298d6c36e964ae4a3e7e9706d1fb8c2 | NaN | NaN |
| 108633 | 31dda685af34476cad5bc968bdb01c53 | offer viewed | 336 | 2298d6c36e964ae4a3e7e9706d1fb8c2 | NaN | NaN |
| 126311 | 31dda685af34476cad5bc968bdb01c53 | transaction | 384 | NaN | 12.57 | NaN |
| 126312 | 31dda685af34476cad5bc968bdb01c53 | offer completed | 384 | 2298d6c36e964ae4a3e7e9706d1fb8c2 | NaN | 3.0 |
| 132624 | 31dda685af34476cad5bc968bdb01c53 | offer received | 408 | 2298d6c36e964ae4a3e7e9706d1fb8c2 | NaN | NaN |
| 153141 | 31dda685af34476cad5bc968bdb01c53 | offer viewed | 426 | 2298d6c36e964ae4a3e7e9706d1fb8c2 | NaN | NaN |
| 153142 | 31dda685af34476cad5bc968bdb01c53 | transaction | 426 | NaN | 15.01 | NaN |
| 153143 | 31dda685af34476cad5bc968bdb01c53 | offer completed | 426 | 2298d6c36e964ae4a3e7e9706d1fb8c2 | NaN | 3.0 |
| 160509 | 31dda685af34476cad5bc968bdb01c53 | transaction | 444 | NaN | 23.18 | NaN |
| 177582 | 31dda685af34476cad5bc968bdb01c53 | offer received | 504 | 2298d6c36e964ae4a3e7e9706d1fb8c2 | NaN | NaN |
| 202611 | 31dda685af34476cad5bc968bdb01c53 | transaction | 534 | NaN | 22.87 | NaN |
| 202612 | 31dda685af34476cad5bc968bdb01c53 | offer completed | 534 | 2298d6c36e964ae4a3e7e9706d1fb8c2 | NaN | 3.0 |
| 204783 | 31dda685af34476cad5bc968bdb01c53 | transaction | 540 | NaN | 29.69 | NaN |
| 215789 | 31dda685af34476cad5bc968bdb01c53 | offer received | 576 | fafdcd668e3743c1bb461111dcafc2a4 | NaN | NaN |
| 230395 | 31dda685af34476cad5bc968bdb01c53 | offer viewed | 582 | fafdcd668e3743c1bb461111dcafc2a4 | NaN | NaN |
# gathering information about a received offer from User
Offer = Event(User.offers[0], User.data)
Offer.events
| person | event | time | offer_id | amount | reward | |
|---|---|---|---|---|---|---|
| 5 | 31dda685af34476cad5bc968bdb01c53 | offer received | 0 | 0b1e1539f2cc45b7b9fa7c272da2e1d7 | NaN | NaN |
| 44475 | 31dda685af34476cad5bc968bdb01c53 | offer viewed | 150 | 0b1e1539f2cc45b7b9fa7c272da2e1d7 | NaN | NaN |
| 46972 | 31dda685af34476cad5bc968bdb01c53 | offer received | 168 | fafdcd668e3743c1bb461111dcafc2a4 | NaN | NaN |
| 57733 | 31dda685af34476cad5bc968bdb01c53 | offer viewed | 168 | fafdcd668e3743c1bb461111dcafc2a4 | NaN | NaN |
from code.data_visualization import time_line
if True:
User = Person('78afa995795e4d85b5d9ceeca43f5fef')
Offer = Event(User.offers[0], User.data)
f, ax = plt.subplots(len(User.offers),1, sharex=True, figsize=[16,12])
plt.subplots_adjust(hspace=0.1)
plt.rc('axes', labelsize=16)
cnt=0
all_trans = User.get_transaction(0, User.data['time'].values[-1])
for idx in User.offers:
plt.sca(ax[cnt])
legend = True if cnt == 0 else False
xlabel = True if cnt == len(ax)-1 else False
Offer = Event(idx, User.data)
trans = User.get_transaction(Offer.start, Offer.end)
time_line(Offer, trans, legend=legend, xlabel=xlabel, text='#'+str(cnt+1)+' ')
cnt +=1
for ax_n in ax:
ymax = ax_n.get_ylim()[1]*.9
ax_n.vlines(
x=all_trans.time,
ymax=ymax, ymin=0,
color='k',
linestyle='dotted',
lw=2,
zorder=0
)
for label in (ax_n.get_xticklabels() + ax_n.get_yticklabels()):
label.set_fontsize(16)
offset=2
for n, x in enumerate(all_trans.time):
ax[-1].text(
x+offset, 0.01,
str(n+1),
weight='bold',
ha='left'
)
plt.show()
if True:
f.savefig('./docs/assets/Timeline.png', dpi=150, transparent=True)
This visualization represents all the events that occured over 30 days (720 hours) of data collection for one participant.
Each row represents an offer received by this individual. The light blue block represents the duration of the offer. In this example, the participants received 4 offers: 3 bogo and 1 informational.
For each offer, 4 different events are marked:
The 7 transactions for this individual are shown accross the offers as black dotted lines. The transactions are not specific to an offer, and a difficulty of this analysis was to assign transactions to specific offers.
Few considerations:
offer can overlap each other: the offer #4 starts during offer #3. In this situation the transaction 6 and 7 could be assigned to both offer #3 and offer #4.
in this example, all offers have been viewed. This is not always the case. Moreover, transactions can be made before the offer is viewed (see transactions 6 and 7 for offer #4)
transactions can be made before, during, or after an offer. We can find transactions between 2 offers (see transaction 5)
not all offer types are presented to all participants. This will lead to differences in overall number of offers in each category.
the number of offers presented to an individual is not always 4 (see further analysis for details).
We decided to assign a transaction to an offer (marked with white dot) if:
DEPENDENT VARIABLE
(TO BE EDITED)
import numpy as np
import pandas as pd
from code.starbucks_class import Person, Event, PORTFOLIO, PROFILE, TRANSCRIPT
result = []
for user_id in set(TRANSCRIPT[:1].person):
User = Person(user_id)
PROFILE.loc[User._id,'total_spending'] = User.total_spending
PROFILE.loc[User._id,'total_offers'] = len(User.offers)
for _id in User.offers:
evt = Event(_id, User.data)
trans = User.get_transaction(evt.viewed, evt.end) # transcations after offer is viewed
trans_all = User.get_transaction(evt.start, evt.end) # all trasnsactions
reward = User.get_reward(evt.viewed, evt.end) # reward after offer is viewed
reward_all = User.get_reward(evt.start, evt.end) # all reward
# results
res_trans = np.sum(trans.amount) if len(trans) > 0 else 0 # total spent when offer viewed
res_trans_avg = (res_trans / (min(evt.end, 720)-evt.viewed))*24 # spending per day once offer viewed
res_reward = np.sum(reward.reward) if len(reward) > 0 else 0
res_trans_all = np.sum(trans_all.amount) if len(trans_all) > 0 else 0
res_reward_all = np.sum(reward_all.reward) if len(reward_all) > 0 else 0
res_tag_viewed = 1 if evt.viewed else 0 # tag offer viewed
res_tag_completed = 1 if evt.completed >= evt.viewed else 0 # tag offer completed after viewed
result.append([
user_id,
PORTFOLIO.loc[evt.offer_id,'code'],
evt.offer_type,
res_trans_all,
res_trans,
res_trans_avg,
res_reward_all,
res_reward,
res_tag_viewed,
res_tag_completed
])
# results
RES = pd.DataFrame(result, columns=[
'profile_id',
'portfolio_code',
'portfolio_type',
'amount_received',
'amount_viewed',
'amount_viewed_avg',
'reward_received',
'reward_viewed',
'view_tag',
'complete_tag'
])
RES
| profile_id | portfolio_code | portfolio_type | amount_received | amount_viewed | amount_viewed_avg | reward_received | reward_viewed | view_tag | complete_tag | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 78afa995795e4d85b5d9ceeca43f5fef | B.07.05 | bogo | 37.67 | 37.67 | 5.580741 | 5.0 | 5.0 | 1 | 1 |
| 1 | 78afa995795e4d85b5d9ceeca43f5fef | I.03.00 | informational | 49.39 | 49.39 | 49.390000 | 0.0 | 0.0 | 1 | 0 |
| 2 | 78afa995795e4d85b5d9ceeca43f5fef | B.07.10 | bogo | 48.28 | 48.28 | 6.897143 | 15.0 | 15.0 | 1 | 1 |
| 3 | 78afa995795e4d85b5d9ceeca43f5fef | B.05.05 | bogo | 48.28 | 0.00 | 0.000000 | 15.0 | 0.0 | 1 | 0 |
PORTFOLIO
| reward | difficulty | duration | offer_type | code | chann_web | chann_mobile | chann_social | |
|---|---|---|---|---|---|---|---|---|
| id | ||||||||
| ae264e3637204a6fb9bb56bc8210ddfd | 10 | 10 | 7 | bogo | B.07.10 | 0 | 1 | 1 |
| 4d5c57ea9a6940dd891ad53e9dbe8da0 | 10 | 10 | 5 | bogo | B.05.10 | 1 | 1 | 1 |
| 3f207df678b143eea3cee63160fa8bed | 0 | 0 | 4 | informational | I.04.00 | 1 | 1 | 0 |
| 9b98b8c7a33c4b65b9aebfe6a799e6d9 | 5 | 5 | 7 | bogo | B.07.05 | 1 | 1 | 0 |
| 0b1e1539f2cc45b7b9fa7c272da2e1d7 | 5 | 20 | 10 | discount | D.10.20 | 1 | 0 | 0 |
| 2298d6c36e964ae4a3e7e9706d1fb8c2 | 3 | 7 | 7 | discount | D.07.07 | 1 | 1 | 1 |
| fafdcd668e3743c1bb461111dcafc2a4 | 2 | 10 | 10 | discount | D.10.10 | 1 | 1 | 1 |
| 5a8bc65990b245e5a138643cd4eb9837 | 0 | 0 | 3 | informational | I.03.00 | 0 | 1 | 1 |
| f19421c1d4aa40978ebb69ca19b0e20d | 5 | 5 | 5 | bogo | B.05.05 | 1 | 1 | 1 |
| 2906b810c7d4411798c6938adc9daaa5 | 2 | 10 | 7 | discount | D.07.10 | 1 | 1 | 0 |
SAVE RESULTS & UPDATED PROFILE
# save results and modified PROFILE to avoid running the analysis
from sqlalchemy import create_engine
if False:
database_filename = './data/db_results.db'
engine = create_engine('sqlite:///'+database_filename)
RES.to_sql('Variables', engine, if_exists='replace', index=False)
PROFILE.to_sql('Profile', con=engine, index=True, index_label='id', if_exists='replace')
from code.data_wrangling import load_from_db
RES, PROFILE = load_from_db()
TARGET & FEATURES
TARGETS:
RES.head()
| profile_id | portfolio_code | portfolio_type | amount_received | amount_viewed | reward_received | reward_viewed | view_tag | complete_tag | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 13bbddb500e54ab392c8da22ae7a41cd | D.07.10 | discount | 40.38 | 40.38 | 2.0 | 2.0 | 1 | 1 |
| 1 | 13bbddb500e54ab392c8da22ae7a41cd | D.10.10 | discount | 75.31 | 58.36 | 12.0 | 10.0 | 1 | 0 |
| 2 | 13bbddb500e54ab392c8da22ae7a41cd | B.07.10 | bogo | 54.04 | 0.00 | 20.0 | 0.0 | 1 | 0 |
| 3 | 13bbddb500e54ab392c8da22ae7a41cd | B.05.10 | bogo | 40.02 | 10.44 | 10.0 | 0.0 | 1 | 0 |
| 4 | 0d15aeb66e044f26b1a50c678cef7fe4 | B.05.05 | bogo | 0.00 | 0.00 | 0.0 | 0.0 | 0 | 0 |
pie_type = RES['portfolio_type'].value_counts(normalize=True)
f,ax=plt.subplots(figsize=[16,8])
plt.pie(
pie_type,
labels=pie_type.index,
autopct='%.2f'
)
plt.show
if False:
# check DFI in import
pie_type.to_frame().style.format("{:.2%}").export_png('./docs/assets/offer_dist.png')
In this section we will look who is viewing which offers. We are trying to identify first if we can identify a population that does not respond to all or some specific offers.
# grouping the results by participants and computing viewing rate per offer type
y_view = RES.groupby(['profile_id', 'portfolio_code'])['view_tag'].mean().unstack()
y_view.head()
| portfolio_code | B.05.05 | B.05.10 | B.07.05 | B.07.10 | D.07.07 | D.07.10 | D.10.10 | D.10.20 | I.03.00 | I.04.00 |
|---|---|---|---|---|---|---|---|---|---|---|
| profile_id | ||||||||||
| 0009655768c64bdeb2e877511632db8f | 1.0 | NaN | NaN | NaN | NaN | 1.0 | 1.0 | NaN | 1.0 | 1.0 |
| 0011e0d4e6b944f998e987f904e8c1e5 | NaN | NaN | 1.0 | NaN | 1.0 | NaN | NaN | 1.0 | 1.0 | 1.0 |
| 0020c2b971eb4e9188eac86d93036a77 | NaN | 1.0 | NaN | 1.0 | NaN | NaN | 1.0 | NaN | 1.0 | NaN |
| 0020ccbbb6d84e358d3414a3ff76cffd | 1.0 | NaN | 1.0 | NaN | 1.0 | NaN | NaN | NaN | 1.0 | NaN |
| 003d66b6608740288d6cc97a6903f4f0 | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | 1.0 | 1.0 | 1.0 |
# missing data rate per offer type
y_view.isna().mean()
portfolio_code B.05.05 0.633050 B.05.10 0.627321 B.07.05 0.625802 B.07.10 0.624077 D.07.07 0.627252 D.07.10 0.631601 D.10.10 0.626424 D.10.20 0.623732 I.03.00 0.626631 I.04.00 0.627528 dtype: float64
The quasi-equality of the rate of missing data ($\approx 63\%$) is probably a sign of the fact that this dataset is simulated
# average viewing rate per offer type
y_view.mean()
portfolio_code B.05.05 0.960224 B.05.10 0.959344 B.07.05 0.986857 B.07.10 0.973421 D.07.07 0.964074 D.07.10 0.986712 D.10.10 0.954669 D.10.20 0.994955 I.03.00 0.969649 I.04.00 0.987120 dtype: float64
Overall, each offer presents the same rate of response: $\approx 97\%$ of each offer type is viewed
# histogram of
y_view.mean(axis=1).hist()
plt.ylabel('participants')
plt.xlabel('viewing rate')
Text(0.5, 0, 'viewing rate')
y_view.mean(axis=1).value_counts(normalize=True).sort_index()
0.500000 0.003797 0.666667 0.017395 0.750000 0.034031 0.800000 0.023331 0.833333 0.001104 0.833333 0.015462 0.875000 0.014427 0.888889 0.002347 0.900000 0.004211 0.916667 0.002071 1.000000 0.881825 dtype: float64
$88\%$ of the participants viewed all the offers presented to them, and over $99\%$ viewed over $66\%$ of the offers presented.
All customers viewed at least $50\%$ of the offers presented to them
we conclude that we do not have a response problem with any offer type
Looking at the offer completed after being viewed, I computed the average completion rate per offer type (more than one offer type could be presnted to each customer)
# average completion (after being viewed) per offer type
# we computed the average since more than one offer of each lind can have be presented
y_complete = RES.groupby(['profile_id', 'portfolio_type'])['complete_tag'].mean().unstack()
y_complete.head()
| portfolio_type | bogo | discount | informational |
|---|---|---|---|
| profile_id | |||
| 0009655768c64bdeb2e877511632db8f | 0.0 | 0.000000 | 0.0 |
| 0011e0d4e6b944f998e987f904e8c1e5 | 1.0 | 1.000000 | 0.0 |
| 0020c2b971eb4e9188eac86d93036a77 | 0.5 | 0.500000 | 0.0 |
| 0020ccbbb6d84e358d3414a3ff76cffd | 1.0 | 1.000000 | 0.0 |
| 003d66b6608740288d6cc97a6903f4f0 | NaN | 0.666667 | 0.0 |
# conversion rate
y_complete.mean()
portfolio_type bogo 0.414981 discount 0.461080 informational 0.000000 dtype: float64
f,ax =plt.subplots(1,2,sharey=True, figsize=[16,8])
plt.subplots_adjust(wspace=.1)
y_complete['bogo'].hist(ax=ax[0])
y_complete['discount'].hist(ax=ax[1], bins=11)
ax[0].set_ylabel('participants')
ax[0].set_xlabel('completion rate')
ax[0].set_title('bogo', weight='bold', fontsize=16)
ax[1].set_xlabel('completion rate')
ax[1].set_title('discount', weight='bold', fontsize=16)
plt.show()
For the rest of this analysis we will consider sucess as completion rate of $50\%$ and higher
# LABEL SUCCESSFUL CONVERSION
y_complete = y_complete.applymap(lambda x: int(False) if x < 0.5 else (int(True) if x==x else None))
JOINING FEATURES AND TARGETS
X = PROFILE.loc[y_complete.index,:]
X.head()
| gender | age | became_member_on | income | total_spending | total_offers | |
|---|---|---|---|---|---|---|
| profile_id | ||||||
| 0009655768c64bdeb2e877511632db8f | M | 33 | 2017-04-21 | 72000.0 | 127.60 | 5.0 |
| 0011e0d4e6b944f998e987f904e8c1e5 | O | 40 | 2018-01-09 | 57000.0 | 79.46 | 5.0 |
| 0020c2b971eb4e9188eac86d93036a77 | F | 59 | 2016-03-04 | 90000.0 | 196.86 | 5.0 |
| 0020ccbbb6d84e358d3414a3ff76cffd | F | 24 | 2016-11-11 | 60000.0 | 154.05 | 4.0 |
| 003d66b6608740288d6cc97a6903f4f0 | F | 26 | 2017-06-21 | 73000.0 | 48.34 | 5.0 |
df = X.join(y_complete.drop('informational', axis=1))
df.gender = df.gender.astype('category')
df.head()
| gender | age | became_member_on | income | total_spending | total_offers | bogo | discount | |
|---|---|---|---|---|---|---|---|---|
| profile_id | ||||||||
| 0009655768c64bdeb2e877511632db8f | M | 33 | 2017-04-21 | 72000.0 | 127.60 | 5.0 | 0.0 | 0.0 |
| 0011e0d4e6b944f998e987f904e8c1e5 | O | 40 | 2018-01-09 | 57000.0 | 79.46 | 5.0 | 1.0 | 1.0 |
| 0020c2b971eb4e9188eac86d93036a77 | F | 59 | 2016-03-04 | 90000.0 | 196.86 | 5.0 | 1.0 | 1.0 |
| 0020ccbbb6d84e358d3414a3ff76cffd | F | 24 | 2016-11-11 | 60000.0 | 154.05 | 4.0 | 1.0 | 1.0 |
| 003d66b6608740288d6cc97a6903f4f0 | F | 26 | 2017-06-21 | 73000.0 | 48.34 | 5.0 | NaN | 1.0 |
CORRELATION MATRIX
df.corr().style.background_gradient(cmap='Blues').format("{:.2f}")
| age | income | total_spending | total_offers | bogo | discount | |
|---|---|---|---|---|---|---|
| age | 1.00 | 0.31 | 0.11 | -0.01 | 0.10 | 0.06 |
| income | 0.31 | 1.00 | 0.33 | -0.01 | 0.22 | 0.13 |
| total_spending | 0.11 | 0.33 | 1.00 | 0.09 | 0.37 | 0.27 |
| total_offers | -0.01 | -0.01 | 0.09 | 1.00 | 0.00 | -0.02 |
| bogo | 0.10 | 0.22 | 0.37 | 0.00 | 1.00 | 0.22 |
| discount | 0.06 | 0.13 | 0.27 | -0.02 | 0.22 | 1.00 |
We originally recorded the number of offers and the total spending because we thought that the former might influence the later. Yet, these 2 variables show very weak correlation coefficient ($r=0.09$).
However, the total_spending shows the highest correlation coefficient with the completion of the bogo and discount offers.
$INCOME = f(AGE)$
size_dict = {
'M':1,
'O':2,
'F':5
}
mkr_dict = {
'M':"o",
'O':"d",
'F':"v"
}
size = [size_dict[k] for k in df.gender]
mkr = [mkr_dict[k] for k in df.gender]
offer=['bogo', 'discount']
# offer=['bogo'] # for github page
f_size = [20,10] if len(offer) > 1 else [10,10]
labels = ['not completed', 'completed', 'M', 'O', 'F']
age_breaks = [36, 48, 75]
inc_breaks = [50000, 75000 ,100000]
f, ax = plt.subplots(1,len(offer), figsize=f_size, sharey=True)
plt.subplots_adjust(wspace=.05)
for i, col in enumerate(offer):
subax = np.array(ax,ndmin=1) if len(offer) == 1 else ax
sns.scatterplot(data=df, x='age', y='income', hue=col, ax=subax[i], palette='colorblind', size=size)
subax[i].vlines(
x=age_breaks,
ymin=df.income.min(),
ymax=df.income.max(),
ls='dashed',
lw=4,
color='k',
alpha=0.6
)
subax[i].hlines(
y=inc_breaks,
xmin=df.age.min(),
xmax=df.age.max(),
ls='dashed',
lw=4,
color='k',
alpha=0.6
)
handles, _ = subax[i].get_legend_handles_labels()
subax[i].legend(handles, labels)
subax[i].set_title(col, weight='bold', fontsize=16)
if False:
f.savefig('./docs/assets/AgeIncome.png', dpi=150, transparent=True)
From this represntation we can clearly identify 2 income breaks at 75k, and 100k. From both the bogo and the discount offers completed groups we can identify a $3^{rd}$ income break at 50k. We will therefore split the income in 4 brackets:
Accordingly, we can identify ages brackets:
$AGE = f(MEMBERSHIP)$
from datetime import datetime
date_breaks = [
datetime(year=2015, month=8, day=1),
datetime(year=2017, month=8, day=1)
]
f, ax = plt.subplots(1,len(offer), figsize=f_size, sharey=True)
plt.subplots_adjust(wspace=.05)
for i, col in enumerate(offer):
subax = np.array(ax,ndmin=1) if len(offer) == 1 else ax
sns.scatterplot(data=df, x='became_member_on', y='age', hue=col, ax=subax[i], palette='colorblind', size=size)
subax[i].vlines(
x=date_breaks,
ymin=df.age.min(),
ymax=df.age.max(),
ls='dashed',
color='k',
lw=4,
alpha=0.6
)
subax[i].hlines(
y=age_breaks,
xmin=df.became_member_on.min(),
xmax=df.became_member_on.max(),
ls='dashed',
lw=4,
color='k',
alpha=0.6
)
handles, _ = subax[i].get_legend_handles_labels()
subax[i].legend(handles, labels)
subax[i].set_title(col, weight='bold', fontsize=16)
if False:
f.savefig('./docs/assets/AgeMember.png', dpi=150, transparent=True)
This representation shows 3 clear periods:
It looks like a lot of women of all ages becoming member during the period 2 are very likely to complete the bogo and the discount offers.
$INCOME = f(MEMBERSHIP)$
f, ax = plt.subplots(1,len(offer), figsize=f_size, sharey=True)
plt.subplots_adjust(wspace=.05)
for i, col in enumerate(offer):
subax = np.array(ax,ndmin=1) if len(offer) == 1 else ax
sns.scatterplot(data=df, x='became_member_on', y='income', hue=col, ax=subax[i], palette='colorblind', size=size)
subax[i].vlines(
x=date_breaks,
ymin=df.income.min(),
ymax=df.income.max(),
ls='dashed',
lw=4,
color='k',
alpha=0.6
)
subax[i].hlines(
y=inc_breaks,
xmin=df.became_member_on.min(),
xmax=df.became_member_on.max(),
ls='dashed',
lw=4,
color='k',
alpha=0.6
)
handles, _ = subax[i].get_legend_handles_labels()
subax[i].legend(handles, labels)
subax[i].set_title(col, weight='bold', fontsize=16)
if False:
f.savefig('./docs/assets/MemberIncome.png', dpi=150, transparent=True)
from code.data_wrangling import create_features
feat = create_features(X)
feat.head(10)
| age_brackets | membership | income_brackets | gender | |
|---|---|---|---|---|
| profile_id | ||||
| 0009655768c64bdeb2e877511632db8f | <36 | 8-2015 to 8-2017 | 50k-74k | M |
| 0011e0d4e6b944f998e987f904e8c1e5 | 36-47 | 8-2017 to 7-2018 | 50k-74k | O |
| 0020c2b971eb4e9188eac86d93036a77 | 48-74 | 8-2015 to 8-2017 | 75k-99k | F |
| 0020ccbbb6d84e358d3414a3ff76cffd | <36 | 8-2015 to 8-2017 | 50k-74k | F |
| 003d66b6608740288d6cc97a6903f4f0 | <36 | 8-2015 to 8-2017 | 50k-74k | F |
| 00426fe3ffde4c6b9cb9ad6d077a13ea | <36 | 8-2015 to 8-2017 | 50k-74k | F |
| 004b041fbfe44859945daa2c7f79ee64 | 48-74 | 8-2017 to 7-2018 | 50k-74k | F |
| 004c5799adbf42868b9cff0396190900 | 48-74 | 8-2015 to 8-2017 | 75k-99k | M |
| 005500a7188546ff8a767329a2f7c76a | 48-74 | 8-2017 to 7-2018 | <50k | M |
| 0056df74b63b4298809f0b375a304cf4 | 48-74 | 8-2015 to 8-2017 | 75k-99k | M |
# Joining conversation result with segregated demographics
df_feat = feat.join(y_complete)
df_feat.head()
| age_brackets | membership | income_brackets | gender | bogo | discount | informational | |
|---|---|---|---|---|---|---|---|
| profile_id | |||||||
| 0009655768c64bdeb2e877511632db8f | <36 | 8-2015 to 8-2017 | 50k-74k | M | 0.0 | 0.0 | 0.0 |
| 0011e0d4e6b944f998e987f904e8c1e5 | 36-47 | 8-2017 to 7-2018 | 50k-74k | O | 1.0 | 1.0 | 0.0 |
| 0020c2b971eb4e9188eac86d93036a77 | 48-74 | 8-2015 to 8-2017 | 75k-99k | F | 1.0 | 1.0 | 0.0 |
| 0020ccbbb6d84e358d3414a3ff76cffd | <36 | 8-2015 to 8-2017 | 50k-74k | F | 1.0 | 1.0 | 0.0 |
| 003d66b6608740288d6cc97a6903f4f0 | <36 | 8-2015 to 8-2017 | 50k-74k | F | NaN | 1.0 | 0.0 |
TOTAL SPENDING PER GROUP
df_feat = df_feat.join(df.total_spending)
# spending per category
spending_cat = df_feat\
.groupby(['age_brackets', 'membership', 'income_brackets', 'gender'])['total_spending']\
.sum()\
.reset_index()
spending_cat.head()
| age_brackets | membership | income_brackets | gender | total_spending | |
|---|---|---|---|---|---|
| 0 | <36 | 7-2013 to 8-2015 | <50k | F | 1162.71 |
| 1 | <36 | 7-2013 to 8-2015 | <50k | M | 6091.33 |
| 2 | <36 | 7-2013 to 8-2015 | <50k | O | 39.49 |
| 3 | <36 | 7-2013 to 8-2015 | 50k-74k | F | 1834.93 |
| 4 | <36 | 7-2013 to 8-2015 | 50k-74k | M | 7130.51 |
spending_cat.describe().style.format("{:,.2f}")
| total_spending | |
|---|---|
| count | 144.00 |
| mean | 12,045.82 |
| std | 22,829.61 |
| min | 0.00 |
| 25% | 143.85 |
| 50% | 1,938.09 |
| 75% | 15,062.94 |
| max | 151,850.52 |
RESULTS TABLE
res_bogo = df_feat\
.groupby(['age_brackets', 'membership', 'income_brackets', 'gender'])['bogo']\
.mean()\
.reset_index()
res_bogo = res_bogo.rename(columns={'bogo':'conversion_bogo'})
res_discount = df_feat\
.groupby(['age_brackets', 'membership', 'income_brackets', 'gender'])['discount']\
.mean()\
.reset_index()
res_discount = res_discount.rename(columns={'discount':'conversion_discount'})
# Results table
ON = ['age_brackets', 'membership', 'income_brackets', 'gender']
results = res_bogo.merge(res_discount, on=ON, suffixes=('_bogo', '_discount')).merge(spending_cat, on=ON)
results
# res_bogo.merge(spending_cat, on=['age_brackets', 'membership', 'income_brackets', 'gender'])
| age_brackets | membership | income_brackets | gender | conversion_bogo | conversion_discount | total_spending | |
|---|---|---|---|---|---|---|---|
| 0 | <36 | 7-2013 to 8-2015 | <50k | F | 0.125000 | 0.521739 | 1162.71 |
| 1 | <36 | 7-2013 to 8-2015 | <50k | M | 0.229167 | 0.560000 | 6091.33 |
| 2 | <36 | 7-2013 to 8-2015 | <50k | O | 0.000000 | 0.000000 | 39.49 |
| 3 | <36 | 7-2013 to 8-2015 | 50k-74k | F | 0.448276 | 0.633333 | 1834.93 |
| 4 | <36 | 7-2013 to 8-2015 | 50k-74k | M | 0.343750 | 0.617647 | 7130.51 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 139 | >75 | 8-2017 to 7-2018 | 75k-99k | M | 0.535211 | 0.592105 | 12564.45 |
| 140 | >75 | 8-2017 to 7-2018 | 75k-99k | O | 0.666667 | 1.000000 | 479.55 |
| 141 | >75 | 8-2017 to 7-2018 | >100k | F | 0.551724 | 0.469388 | 9761.93 |
| 142 | >75 | 8-2017 to 7-2018 | >100k | M | 0.571429 | 0.500000 | 2998.71 |
| 143 | >75 | 8-2017 to 7-2018 | >100k | O | NaN | NaN | 0.00 |
144 rows × 7 columns
results.sort_values('total_spending', ascending=False)[:10].style\
.format(
{
'conversion_bogo': "{:.2%}",
'conversion_discount': "{:.2%}",
'total_spending': "$ {:,.2f}"
}
)
| age_brackets | membership | income_brackets | gender | conversion_bogo | conversion_discount | total_spending | |
|---|---|---|---|---|---|---|---|
| 90 | 48-74 | 8-2015 to 8-2017 | 75k-99k | F | 68.49% | 68.04% | $ 151,850.52 |
| 91 | 48-74 | 8-2015 to 8-2017 | 75k-99k | M | 73.08% | 73.88% | $ 104,438.12 |
| 87 | 48-74 | 8-2015 to 8-2017 | 50k-74k | F | 77.00% | 74.03% | $ 103,213.77 |
| 88 | 48-74 | 8-2015 to 8-2017 | 50k-74k | M | 66.16% | 71.38% | $ 100,064.15 |
| 93 | 48-74 | 8-2015 to 8-2017 | >100k | F | 67.80% | 58.80% | $ 68,123.89 |
| 100 | 48-74 | 8-2017 to 7-2018 | 50k-74k | M | 33.84% | 40.54% | $ 67,380.78 |
| 102 | 48-74 | 8-2017 to 7-2018 | 75k-99k | F | 48.25% | 53.54% | $ 67,294.50 |
| 99 | 48-74 | 8-2017 to 7-2018 | 50k-74k | F | 51.56% | 51.76% | $ 57,019.08 |
| 103 | 48-74 | 8-2017 to 7-2018 | 75k-99k | M | 45.71% | 53.37% | $ 51,573.69 |
| 52 | 36-47 | 8-2015 to 8-2017 | 50k-74k | M | 72.20% | 72.29% | $ 45,660.70 |
bogo¶results.sort_values('conversion_bogo', ascending=False)[:10]\
.style.format(
{
'conversion_bogo': "{:.2%}",
'conversion_discount': "{:.2%}",
'total_spending': "$ {:,.2f}"
}
)
| age_brackets | membership | income_brackets | gender | conversion_bogo | conversion_discount | total_spending | |
|---|---|---|---|---|---|---|---|
| 74 | 48-74 | 8-2015 to 8-2017 | >100k | O | 100.00% | 100.00% | $ 121.30 |
| 32 | 36-47 | 8-2015 to 8-2017 | <50k | O | 100.00% | 100.00% | $ 487.64 |
| 62 | 48-74 | 7-2013 to 8-2015 | >100k | M | 100.00% | 100.00% | $ 166.93 |
| 51 | 36-47 | 8-2017 to 7-2018 | >100k | F | 100.00% | 100.00% | $ 155.14 |
| 41 | 36-47 | 8-2015 to 8-2017 | >100k | O | 100.00% | 100.00% | $ 262.28 |
| 39 | 36-47 | 8-2015 to 8-2017 | >100k | F | 100.00% | 100.00% | $ 1,693.44 |
| 60 | 48-74 | 7-2013 to 8-2015 | 75k-99k | O | 100.00% | 100.00% | $ 325.54 |
| 8 | <36 | 8-2015 to 8-2017 | <50k | O | 100.00% | 66.67% | $ 331.37 |
| 71 | 48-74 | 8-2015 to 8-2017 | 75k-99k | O | 94.12% | 100.00% | $ 4,307.54 |
| 36 | 36-47 | 8-2015 to 8-2017 | 75k-99k | F | 89.86% | 90.91% | $ 18,230.87 |
diff_conversion = np.abs(results.conversion_bogo.sub(results.conversion_discount))
idx_diff = diff_conversion.sort_values(ascending=False).index.to_list()
results.loc[idx_diff[:10],:]\
.style.format(
{
'conversion_bogo': "{:.2%}",
'conversion_discount': "{:.2%}",
'total_spending': "$ {:,.2f}"
}
)
| age_brackets | membership | income_brackets | gender | conversion_bogo | conversion_discount | total_spending | |
|---|---|---|---|---|---|---|---|
| 35 | 36-47 | 8-2015 to 8-2017 | 50k-74k | O | 50.00% | 100.00% | $ 1,149.41 |
| 53 | 48-74 | 7-2013 to 8-2015 | <50k | F | 7.69% | 51.85% | $ 1,721.90 |
| 0 | <36 | 7-2013 to 8-2015 | <50k | F | 12.50% | 52.17% | $ 1,162.71 |
| 87 | >75 | 7-2013 to 8-2015 | <50k | M | 5.56% | 42.11% | $ 1,026.15 |
| 23 | 36-47 | 7-2013 to 8-2015 | <50k | M | 18.75% | 52.31% | $ 3,115.66 |
| 8 | <36 | 8-2015 to 8-2017 | <50k | O | 100.00% | 66.67% | $ 331.37 |
| 110 | >75 | 8-2017 to 7-2018 | 75k-99k | O | 66.67% | 100.00% | $ 479.55 |
| 61 | 48-74 | 7-2013 to 8-2015 | >100k | F | 83.33% | 50.00% | $ 2,318.40 |
| 40 | 36-47 | 8-2015 to 8-2017 | >100k | M | 66.67% | 100.00% | $ 867.38 |
| 12 | <36 | 8-2015 to 8-2017 | 75k-99k | F | 66.67% | 33.33% | $ 1,468.14 |
from code.data_wrangling import find_best_offer
age = 43
member = '2017-09-01'
income = 63000
gender = 'M'
find_best_offer(results, age=43, member=member, income=income, gender='M')
| age_brackets | membership | income_brackets | gender | conversion_bogo | conversion_discount | total_spending | |
|---|---|---|---|---|---|---|---|
| 46 | 36-47 | 8-2017 to 7-2018 | 50k-74k | M | 0.3 | 0.386228 | 23683.4 |
Based on the conversion of bogo and discount, I am testing a continuous metric that would be better I quantifying the performance for each offer type. I am using the amount_viewed which is the sum of all transaction that occurred once an offer is viewed until the end of the diration of that offer.
First I try to filter our population to make it more coherent by checking for outliers in the total_spending and the total_offers (total number of offers received by each customer)
f, ax = plt.subplots(ncols=3, figsize=[16,8])
sns.histplot(data=PROFILE, y='total_spending', kde=True, palette='coolwarm', ax=ax[0])
sns.boxplot(data=PROFILE,y='total_spending', ax=ax[1])
sns.boxplot(data=PROFILE,y='total_offers', ax=ax[2])
plt.show()
After looking at these distributions I set:
total_spending based on the $3^{rd} quartile + (1.5 \times IQR)$, where IQR is the Inter Quartile Range. This is the traditional limit for outliers. This limit is represented by the upper whisker on the boxplottotal_offers at 3 offers (same method to find outliers, but obvious on boxplot)# compute Inter Quartile Range for total_spending to filter out outliers
q3 = PROFILE.total_spending.quantile(0.75)
q1 = PROFILE.total_spending.quantile(0.25)
iqr = q3-q1
upper_bound = q3+1.5*iqr
# option to filter or not
if True:
P_filt = PROFILE.query('total_offers >= 3')
P_filt = P_filt.query('5 <total_spending < @upper_bound')
else:
P_filt = PROFILE
f, ax = plt.subplots(ncols=2,figsize=[16,8], sharey=True)
plt.subplots_adjust(wspace=.1)
plt.suptitle("Distribution of total spending after filtering")
d = sns.histplot(
data=P_filt,
y='total_spending',
hue='gender',
kde=True,
ax=ax[0],
palette='coolwarm'
)
b = sns.boxplot(
data=P_filt,
y='total_spending',
x='gender',
ax=ax[1],
palette='coolwarm'
)
ax[0].set_ylabel('Total spending over 30 days ($)')
ax[1].set_ylabel('')
plt.show()
I then keep in our results dataframe only the customers present in our filtered group.
# find id of customers remaining after filtering
idx_filt = list(set(RES.profile_id).intersection(P_filt.index))
# keep only customers from filtered group
RES_filt = RES.query('profile_id in @idx_filt')
# group result table by profile_id and portfolio_type
df = RES_filt.groupby(['profile_id', 'portfolio_type'])[['complete_tag', 'amount_viewed']]\
.mean()\
.reset_index('portfolio_type')
print(f'number of customers: {len(df)}')
df.head()
number of customers: 32084
| portfolio_type | complete_tag | amount_viewed | |
|---|---|---|---|
| profile_id | |||
| 0009655768c64bdeb2e877511632db8f | bogo | 0.0 | 14.110 |
| 0009655768c64bdeb2e877511632db8f | discount | 0.0 | 41.380 |
| 0009655768c64bdeb2e877511632db8f | informational | 0.0 | 15.365 |
| 0011e0d4e6b944f998e987f904e8c1e5 | bogo | 1.0 | 54.040 |
| 0011e0d4e6b944f998e987f904e8c1e5 | discount | 1.0 | 28.505 |
amount_viewed¶In this section I am testing if indeed the amount_viewed is responsive to the metric of success that we used the previous analysis: the conversion rate
# as computed before, I declared a offer successfully converted if customer has an average conversion > 50% for that offer
df['success'] = df['complete_tag'].apply(lambda x: int(True) if x > 0.5 else (int(False) if x==x else None))
from scipy.stats import ttest_ind, ttest_rel
for offer_name in ['bogo', 'discount']:
a = df.query('(success == 0) & (portfolio_type != @offer_name)')['amount_viewed']
b = df.query('(success == 1) & (portfolio_type != @offer_name)')['amount_viewed']
test = ttest_ind(a, b, alternative='less')
print('----',f'\nOffer: {offer_name.upper()}', f'\nT-value= {test.statistic:.3f}; p = {test.pvalue}\n')
sns.boxplot(data=df, x='success', y='amount_viewed')
plt.show()
---- Offer: BOGO T-value= -101.636; p = 0.0
---- Offer: DISCOUNT T-value= -101.117; p = 0.0
The T-Test comparing the amount_viewed in the unsuccessful group and the successful group clearly reject the null hypothesis (p < 0.05). The unsuccessful group shows significantly lower spendings after viewing an offer than the successful group.
I will use the amount_viewed to build a model of performance for each offer type based on the demographics and the total_spending
To make the model more accurate I decided to:
became_member_on type from datetime to timestamp to make it numerical (seconds since January 1st 1970)from code.data_modeling import filter_by_offer
offer_name = 'discount'
X, y = filter_by_offer(df, offer_name)
# choosing the transformation for normalization
f, ax = plt.subplots(ncols=3, figsize=[16,8])
sns.histplot(y, kde=True, ax=ax[0]); ax[0].set_title('Original')
sns.histplot(np.sqrt(y), kde=True, color='red', ax=ax[1]); ax[1].set_title('Square Root')
sns.histplot(np.log10(y), kde=True, color='green', ax=ax[2]); ax[2].set_title('Log(10)')
plt.suptitle('Target Variable Normalization', weight='bold', fontsize=16)
plt.show()
the distribution of
amount_viewedfor thediscountoffer is clearly skewed, so I explored the possibility of normalizing it to make it more normalI will test a linear regression model on the 3 distributions
from code.data_modeling import run_model
# testing with no normalization
run_model(X, y)
Building model...
Training model...
{'transformedtargetregressor__regressor__alpha': 0.1,
'transformedtargetregressor__regressor__max_iter': None,
'transformedtargetregressor__regressor__tol': 0.001}
Best Score:62.07%
Evaluating model...
This model explains 62.57% of the variance of the amount_viewed
Our predictions are wrong by 76.96%
Which represent on average $12.57
# testing with sqrt normalization
run_model(X, y, 'sqrt')
Building model...
Training model...
{'transformedtargetregressor__regressor__alpha': 0.1,
'transformedtargetregressor__regressor__max_iter': None,
'transformedtargetregressor__regressor__tol': 0.001}
Best Score:59.99%
Evaluating model...
This model explains 61.18% of the variance of the amount_viewed
Our predictions are wrong by 81.11%
Which represent on average $12.44
# testing with log10 normalization
run_model(X, y, 'log10')
Building model...
Training model...
{'transformedtargetregressor__regressor__alpha': 0.1,
'transformedtargetregressor__regressor__max_iter': None,
'transformedtargetregressor__regressor__tol': 0.001}
Best Score:15.46%
Evaluating model...
This model explains 23.23% of the variance of the amount_viewed
Our predictions are wrong by 99.06%
Which represent on average $9.18
for offer_name in ['bogo', 'discount', 'informational']:
X, y = filter_by_offer(df, offer_name)
print('--------------', f'\nOFFER: {offer_name.upper()}')
run_model(X, y)
--------------
OFFER: BOGO
Building model...
Training model...
{'transformedtargetregressor__regressor__alpha': 0.1,
'transformedtargetregressor__regressor__max_iter': None,
'transformedtargetregressor__regressor__tol': 0.001}
Best Score:58.57%
Evaluating model...
This model explains 58.29% of the variance of the amount_viewed
Our predictions are wrong by 102.15%
Which represent on average $10.29
--------------
OFFER: DISCOUNT
Building model...
Training model...
{'transformedtargetregressor__regressor__alpha': 0.1,
'transformedtargetregressor__regressor__max_iter': None,
'transformedtargetregressor__regressor__tol': 0.001}
Best Score:62.45%
Evaluating model...
This model explains 61.12% of the variance of the amount_viewed
Our predictions are wrong by 82.92%
Which represent on average $12.76
--------------
OFFER: INFORMATIONAL
Building model...
Training model...
{'transformedtargetregressor__regressor__alpha': 0.1,
'transformedtargetregressor__regressor__max_iter': None,
'transformedtargetregressor__regressor__tol': 0.001}
Best Score:42.17%
Evaluating model...
This model explains 45.37% of the variance of the amount_viewed
Our predictions are wrong by 129.80%
Which represent on average $8.91